Admixture simple calibration

Setup

Useful functions

Read data

Data summaries

hapflk %>% dplyr::select(replicate, s,m) %>% distinct() %>% mutate(replicate=factor(replicate)) %>% group_by(s,m) %>% summarise(analysed_simulations=n())plt <- hapflk %>% dplyr::select(pos, replicate,s,m)%>% distinct() %>% group_by(replicate,s,m) %>% summarise(loci=n()) %>% ggplot() + geom_histogram(aes(x=loci, color = s), fill= NA) + wrap_by(s,m) + theme_light() + scale_color_manual(values=c( "black", "#e31a1c"))plt
frequencies %>% filter(generation==max_gen) %>% filter(pop=="p4") %>% ggplot(aes(x=s, y=freq)) + geom_violin(fill=NA) + geom_sina(alpha= 0.3) + ylim(c(0,1)) + theme_light()

Hay algunas simulaciones que perdieron m2 en p4, asi que filtro sólo aquellas que no se perdieron.

Vamos a ver como cambio el numero de replicas usadas

La poblacion que esta bajo seleccion es p4, en el escenario con m=0, la mutacion adaptativa probabemente se perdió antes del sweep.

Statistics QC

Statistics distribution

View distribution of FLK values for no admixture scenario (m=0) and for admixture scenario (m=0.3).

Now compute values of hapFLK for no migration and migration.

We can more clearly compare the distributions in an overimposed plot. First for the FLK statistic and then for the hapFLK statistic.

All covariance estimation methods look similar except for treemix which appears to be a scaled version of the other ones.

Statistics goodness-of-fit

Chi-squared distribution

qqplots are not very informative when superimposed and faceted this way, so let's plot them by m and s, and facet them according to covariance matrices in a square manner. First, define a function to make code more compact.

Now compute goodness-of-fit for hapFLK values.

Let’s explore in more detail how good the statistics fit to a χ2 by plotting joint density plots for the empirical and theoretical distributions.

Now, for hapflk statistic.

hapFLK statistic not looking so good.

Assuming the fit is good, lets compute p-values and q-values of hapFLK for each run, plot their distribution, and compare them in manhattan plots.

Draw p-value histograms

Let's see p-value manhattan plots

Normal distribution

Same as qqplots for chi-squared distribution, define a function for faceting according to m and s.

I could visualize the Kernel Density Estimates together with the theoretical distributions. But I won't do it right now.

Assuming hapFLK follows a normal distribution, compute p-values and q-values. Make p values histograms and Manhattan plots.

Manhattan plots now (only for selection scenarios)

The above p-values are computed using the mean and standard deviation, which are sensistive to the presence of outliers. We can estimate the mean and standard deviation more robustly using the rlm() function for robust linear models estimation.

Compute p-values

Draw histograms

Manhattan plots for selection

Raw statistics Manhattan plots

Chi squared rescaling for hapFLK

hapFLK values will be rescaled with theoretical values (for given degrees of freedom (npop−1)(K−1) using quantile (robust) regression.

Let's compare the rescaled density estimates with unscaled estimates, together with the theoretical chi-squared distribution.

For no migration:

For migration:

Scaling seems to work pretty well, lets draw p-value histograms and manhattan plots to see if everything behaves as expected

p-value histograms for m=0

p-value histograms for m=0.3

Manhattan plots for m=0

Manhattan plots for m=0.3

Normal robust rescaling of hapFLK values

We should visualize and compared the computed scaled densities with unscaled densities.

First, m=0

Now, m=0.1

What about p-value distributions?

For m=0

For m=0.3

And now, for everyone's delight, Manhtattan plots!

For m=0

For m=0.3

Power analysis

Deine functions for power analysis and plotting

Plot ROCs for m=0

Plot ROCs for m=0.3

Looking into anomalies

Treemix vs others

I wanna see whats the relationship between treemix estimated hapflk values and the other covariance matrices. To do this, I will do scatterplots and linear regressions of treemix vs all of the other hapflk values. If it is a matter of scale, then treemix will have a linear relationship with the other statistics.

p-value distributions

p-values distributions for chi-squared rescaled values were pretty weird, with a high bar close to one. I want to revisit the code and see what's that about.

El problema era que no habia filtrado las categorias scaled y unscaled.

Frequency at selection